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CN Abstract 

f->i Works, briefly surveyed here, are concerned with two basic methods: 

Maximum Probabihty and Bayesian Maximum Probability; as well as with 



their asymptotic instances: Relative Entropy Maximization and Maxi- 
^^ mum Non-parametric Likelihood. Parametric and empirical extensions of 

O^ the latter methods - Empirical Maximum Maximum Entropy and Empir- 

ical Likelihood - are also mentioned. The methods are viewed as tools for 
solving certain ill-posed inverse problems, called Ll-problem, <l?-problem, 
respectively. Within the two classes of problems, probabilistic justifica- 
tion and interpretation of the respective methods are discussed. 
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> 1 ^-problem, MAP, MNPL 

04 The <i>-problem can be loosely stated as follows: there is a prior distribution 

^^ over a non-parametric set $ of data-sampling distributions and a sample from 

unknown data-sampling distribution. The objective is to select a data-sampling 
'^ distribution from the set $, called model. 

!rt: More formally: Let V be the set of all probability mass functions^ (pmf 's) 

^— V with finite support X. The set V is endowed with the usual topology. Let 

•• ^ CV. Let XI' = Xi,X2,..., Xn be i.i.d. sample from pmf r eV. The 'true' 

.5^ sampling distribution r need not be in $; in other words: the model $ might be 

S^ misspecified. A strictly positive prior 7r(-) is put over <&. The objective in the 

^ <&- problem is to select a sampling distribution q from $, when the information 

summarized by {A", X", 7r(-), <&} and nothing else is available. 

Bayesian Maximum Probability method selects the Maximum A-Posteriori 
Probable (MAP) data-sampling distribution(s) (Jmap = argsup g^ 7r„(q | X"); 
there the posterior probability n„{q\X^) ex e^^"^'i^n{q), and l,i{q) is used to 
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^For the sake of simplicity the presentation is restricted to the discrete case. The continuous 
case is treated in firil. 



denote — X]r=i ^*^S'?(^«)' ^"-"S ^^ meant with the base e. Hence the standard 
abbreviation, MAP, for the method. 

The Bayesian Sanov Theorem (BST), through its corollary - the Bayesian 
Law of Large Numbers (BLLN) - provides a strong case for MAP as the correct 
method for solving the <I>-problem. The theorems are Bayesian counterparts of 
the well-known Large Deviations (LD) theorems for empirical measures: the 
Sanov Theorem and the Conditional Law of Large Numbers (cf. [4] and Sect. 
2). In order to state the theorems it is necessary to introduce the L-divergence 
L{q\\p) oi q e V with respect to p £ V: L{q\\p) = —J2xP^'^S^- ^he L- 
projection q of p on Q C_'P is q = arginf^gg L((7||p). The value of L-divergence 
at an L-projection of p on Q is denoted by L(Q\\p). 

Thm 1. (BST) Let Xf be i.i.d. r. Let Q C ^ C V; L{Q || r) < oo. Then for 
n -. CX3, i log7r„(Q e Q\X^) = -{L{Q\\r) - i($ || r)}, a.s. r^ . 

The posterior probability 7r„((5|X") decays exponentially fast (a.s. r°°) 
with the decay rate L{Q \\r) — L{^ \\r). For a proof see [13]. To the best of our 
knowledge Ben-Tal, Brown and Smith [I] were the first to use an LD reasoning 
in the Bayesian nonparametric setting. Ganesh and O'Connell [8] proved BST 
for the well-specified special case; i.e., r S $, by means of formal LD. 

Thm 2. (BLLN) Let <^ (- V be a convex, closed set. Let B(q,e), be a closed 
e-ball defined by the total variation metric, centered at the L-projection q of r 
on $. Then, lim„^oo 7r„(q G B{q, e) \ X") = 1, a.s. r°°. 

The BLLN is an extension of Freedman's Bayesian nonparametric consis- 
tency theorem [.■ ] to the case of misspecified model. It shows that the posterior 
probability concentrates (a.s. r°°) on the L-projection of the 'true' sampling 
distribution r on $. For a book- length treatment of Bayesian non-parametric 
consistency see [9]. 

MAP satisfies the BLLN. To see this, note that by the Strong Law of Large 
Numbers (SLLN) , conditions for supremum of the posterior probability asymp- 
totically turn into conditions for supremum of the negative of L-divergence. 
This also permits to view the L-projections as asymptotic instances of MAP 
distributions (Jmap- 

There is also another method which satisfies the BLLN: Maximum Non- 
parametric Likelihood (MNPL). This can be shown by the above mentioned 
recourse to the SLLN. MNPL selects giviNPL — argsup^gQ — Z„(g). 

These two (up to trivial transformations) are the only methods for solving the 
i>-problem, which comply with the BLLN; hence they are consistent in the well- 
specified as well as in the misspecified case. Selecting a sampling distribution by 
some other conceivable method would, in general, asymptotically select sampling 
distribution which is a posteriori zero-probable. In this sense, selection of, say, 
the posterior mean, or selection of argsup ^^ — ^^ 9 log ^, are ruled out. 

The <i>-problem becomes more interesting when turned into a parametric 
setting. To this end, let X be a random variable with pmf r{x; 9) parame- 
trized by 9 G Q C M.^ . Assume that a researcher is not willing to specify 
parametric family q{X; 9) of data-sampling distributions, but is only willing 
to specify some of its underlying features. These features, i.e., the model <&, 
can be characterized by Estimating Equations (EE): <!> — IJe^(^)' ^ti^re 
$(6») 4 {q{x;9) : Y,.^q{x;9)uj{x;9) = 0, 1 < j < J}, 6* e 6 C M^^'. In the 



EE theory parlance, u(-) are the estimating functions, number of which is in 
general different than the number K of parameters 9. The 'true' data sampling 
distribution r(x; 9) need not belong to <&. A Bayesian puts positive prior tt over 
$, which in turn induces prior tt{9) over Q; cf. [ ]. By the BLLN, the posterior 
7r„(-|X") concentrates on a weak neighborhood of the L-projection q of r{x; 9) 
on $: 

q{x;9) = aigini inf L{q{x;9)\\r{x;9)). 

This thus provides a probabilistic justification for using as an estimator of 
9. Thanks to the convex duality, the estimator 9 can be obtained also as 
9 = argsupege mix{e)eRJ - Lll i "^i^i) log(l - J2j ^j{9)uj{xi; 9)). Since r is in 
practice not known, following [19], one can estimate the convex dual objective 
function by — J2'i=i log(l ~ X), ^ji(^)'^ji^i'j(^))- The resulting estimator is just 
the Empirical Likelihood (EL) estimator (cf. ['J,->], [24], ['21]). It can be easily 
seen that EL satisfies the BLLN. The same is true for the Bayesian MAP es- 
timator quApix] 9map) = argsup0g0sup^(^.g)g$(g) Trn{q{x; 9) \ Xf ). For further 
results and discussion see [15], [IG]. 

2 n-problem, MaxProb, REM 

Unlike the $ problem, the H problem is not a statistical problem. In the 11 
problem, the sampling distribution q is known, and there is a set 11 C P, into 
which an unavailable empirical pmf, drawn from q, is assumed to belong. The 
objective is to select an empirical pmf (also known as type, cf. [ ]) from the set 
n. Thus, the <& and 11 problems are opposite to each other. 

More formally: let A" be a set of m elements. Type i^" = [ni, n2, . . . , n„i]/n, 
where ni is the number of occurrences of i-th element of X (i.e., outcome), 
i = 1, 2, . . . , m, in a sample of size n, drawn from sampling distribution q. The 
objective in the Il-problem is to select a type(s) ;/" from 11, when the information 
summarized by {X , q, n, 11} and nothing else is available. 

Maximum Probability (MaxProb) method (cf. [2], [29], [10]) selects the 
type i>" = argsup^ngn i^{v^\ q) which can be generated by the sampling distri- 
bution q, with the highest probability. If the sampling is i.i.d., then 7r(r/"'; q) = 

nlYl"Li ^- Niven [22] expanded MaxProb into non-i.i.d. and combinatorial 
settings; see also [2.3], [29], [14]. 

The Sanov Theorem (ST) (cf. [26], [-i]), through its corollary - the Con- 
ditional Law of Large Numbers (CLLN) (cf. [28], [27], ['']) - provides a prob- 
abilistic justification for application of MaxProb in the i.i.d. instance of the 
Il-problem. The ST identifies the exponential decay rate function as the I- 
divergence I{p\\ q) — X^plog ^, p,q ^ V- The /-projection p of g on 11 C P is 

p — arginfpgn I{p \\q)- The value of the /-divergence at an /-projection of q on 
nis denoted by /(n||g). 

Thm 3. (ST) Let H be an open set; I(Il\\q) < oo. Then, for n — > cx), 
ilog^(i/"en;g)--/(n||q). 

The rate of the exponential convergence of the probability 7r(j/" G 11; q) 
towards zero is determined by the information divergence at (any of) the /- 
projection(s) of q on 11. 



Thm 4. (CLLN) Let H be a convex, closed set that does not contain q. Let 
B{p, e) be a closed e-ball defined by the total variation metric that is centered at 
the I -projection p of q on 11. Then, lim„^oo 7r(:^" G B{p, e) | :/" S 11; g) = 1. 

Given that a type from 11 was observed, it is asymptotically zero-probable 
that the type was different than the /-projection of the sampling distribution q 

on n. 

It is straightforward to see that MaxProb satisfies CLLN. Indeed, set of 
MaxProb types converges to set of /-projections, as n ^ qg; cf. [11], [10]. 
Relative Entropy Maximization method (REM/MaxEnt) which maximizes, with 
respect to p, the negative of /-divergence (a.k.a., relative entropy) thus can be 
viewed as asymptotic form of MaxProb method. 

Still, it is possible to solve H-problem by selecting the type(s) with the high- 
est value of relative entropy; in other words, to view REM as a self-standing 
method for solving H-problem, rather than as an asymptotic instance of Max- 
Prob. Obviously, REM satisfies CLLN. 

MaxProb and REM/MaxEnt are the only two methods which satisfy CLLN. 
Selection of the mean type, which was under the name ExpOc proposed in [10], 
or selection of, say the type with the highest value of Tsallis entropy, would in 
general, violate CLLN. 

The H-problem originated in Statistical Physics, where H is formed by mean 
energy constraint; see [']. In [12] feasible set of types formed by interval obser- 
vations was considered. 

Estimating Equations can be used to expand the H problem into parametric 
setting. This time, the EE define a feasible set H into which an unobserved 
parametrized type v^^{6) is supposed to belong: H = [Jq^{6), where H(6') — 
{p{x;e) : J2xPi^'^^)^ji^'^^) = 0,1 < j < J}, 61 G 6 C M^^'. The true data- 
sampling distribution r{x; 9) need not belong to H. The parametric H-problem is 
framed by the information {X , r, n, H(^), O}, and the objective is now to select 
parametric type z/"(0) from H. CLLN implies (cf. [20]) that the parametric 
H-problem should be (for n -^ oo) solved by selecting 

p(a;; ^) = arg inf mi I{p{x;e)\\r{x;e)). 

Thanks to the convex duality, the estimator can equivalently be obtained 
as ^ = argsupg^Qmix(9)eR^logJ2'^^^r{x^;0)exp{-J2j=l>'3{S)uj{xi■,0)). The 
estimator is known as Maximum Maximum Entropy (MaxMaxEnt) estimator. 
The parametric H-problem can be made more realistic, by assuming that a 
sample of size N is available to a modeler. Kitamura and Stutzer [ ] suggested 
to use the sample to estimate the convex dual objective function by its sample 
analogue log^^^j^ exp(— X]i=i ^jUj{xi; 0)). The resulting method is known as 
Empirical Maximum Maximum Entropy (EMME) method, or Maximum En- 
tropy Empirical Likehhood (cf. [17], [19], [21], [18]). 
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